NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

InfoLossQA: Characterizing and Recovering Information Loss in Text Simplification

Trienes, Jan; Joseph, Sebastian; Scholotterer, Jorg; Seifert, Christin; Lo, Kyle; Xu, Wei; Wallace, Byron C; Li, Jessy (August 2024, Association for Computational Linguistics)

Full Text Available
Decomposing Complex Queries for Tip-of-the-tongue Retrieval

https://doi.org/10.18653/v1/2023.findings-emnlp.367

Lin, Kevin; Lo, Kyle; Gonzalez, Joseph; Klein, Dan (December 2023, Association for Computational Linguistics)

Full Text Available
InfoLossQA: Characterizing and Recovering Information Loss in Text Simplification

https://doi.org/10.18653/v1/2024.acl-long.234

Trienes, Jan; Joseph, Sebastian; Schlötterer, Jörg; Seifert, Christin; Lo, Kyle; Xu, Wei; Wallace, Byron; Li, Junyi Jessy (January 2024, Association for Computational Linguistics)

Full Text Available
Decomposing Complex Queries for Tip-of-the-tongue Retrieval

Lin, Kevin; Lo, Kyle; Gonzalez, Joseph_E; Klein, Dan (May 2023, arXiv)

Full Text Available
DataComp-LM: In search of the next generation of training sets for language models

Li, Jeffrey; Fang, Alex; Smyrnis, Georgios; Ivgi, Maor; Jordan, Matt; Gadre, Samir; Bansal, Hritik; Guha, Etash; Keh, Sedrick; Arora, Kushal; et al (April 2025, https://doi.org/10.48550/arXiv.2406.11794)

The authors introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments aimed at improving language models. DCLM provides a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants can experiment with dataset curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline, the authors find that model-based filtering is critical for assembling a high-quality training set. Their resulting dataset, DCLM-Baseline, enables training a 7B parameter model from scratch to achieve 64% 5-shot accuracy on MMLU with 2.6T training tokens. This represents a 6.6 percentage point improvement over MAP-Neo (the previous state-of-the-art in open-data LMs), while using 40% less compute. The baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% and 66%), and performs similarly on an average of 53 NLU tasks, while using 6.6x less compute than Llama 3 8B. These findings emphasize the importance of dataset design for training LMs and establish a foundation for further research on data curation.
more » « less
Free, publicly-accessible full text available April 21, 2026

Search for: All records